-
Notifications
You must be signed in to change notification settings - Fork 4.4k
Async Post Load Test Fixes #36596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Async Post Load Test Fixes #36596
Conversation
Fixes three problems found in load testing: 1) We should not do any blocking operations under a lock in commit_finished_items. Add items outside the lock. 2) Timers should be skewed so they don't all fire at once.
Summary of ChangesHello @dustin12, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses critical performance and stability issues identified during load testing of asynchronous processing within the system. The primary focus is on optimizing resource utilization and preventing contention by ensuring that blocking operations do not occur under locks and by distributing the load of timer-triggered events. These changes collectively aim to enhance the robustness and efficiency of the asynchronous DoFn, particularly under high-load scenarios. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment |
…e function does not define timeout.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #36596 +/- ##
=============================================
- Coverage 56.93% 40.21% -16.72%
Complexity 3394 3394
=============================================
Files 1222 1222
Lines 186787 187021 +234
Branches 3545 3545
=============================================
- Hits 106348 75214 -31134
- Misses 77067 108435 +31368
Partials 3372 3372
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Assigning reviewers: R: @jrmccluskey for label python. Note: If you would like to opt out of this review, comment Available commands:
The PR bot will only process comments in the main thread (not review comments). |
|
Run precommit |
| coders.TupleCoder([coders.StrUtf8Coder(), coders.StrUtf8Coder()]), | ||
| ) | ||
| _timer_frequency = 20 | ||
| coders.TupleCoder( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a backwards incompatible change, since you're swapping to a different coder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There shouldn't be any usage of this yet so I'm OK with that.
| self._parallelism = parallelism | ||
| self._max_wait_time = max_wait_time | ||
| self._timer_frequency = 20 | ||
| self._timer_frequency = callback_frequency |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As best I can tell, self._timer_frequency is used but self.timer_frequency_ is not. Is there any reason to have both? Same goes for all of these duped fields
| sleep_time = 0.01 | ||
| total_sleep = 0 | ||
| while not done: | ||
| timeout = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the timeout duration could be configurable in __init__()
| def next_time_to_fire(self, key): | ||
| random.seed(key) | ||
| return ( | ||
| floor((time() + self._timer_frequency) / self._timer_frequency) * | ||
| self._timer_frequency) | ||
| self._timer_frequency) + ( | ||
| random.random() * self._timer_frequency) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like doing all of the work to find a round increment of _timer_frequency is wasted compute once you add the extra fuzziness of random.random() * self._timer_frequency since you're no longer on a round increment afterwards
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started just having keys setting a timer now + 10s. That doesn't work because as new work arrives the timer firing time keeps getting pushed out. ie. an element arrives at t=1, we want to check back on it at t=11 so we set the timer, but then an element arrives at t=9 and overwrites the timer to t=19.
Next setup was having this round increment firing time. so any message that arrives between t=0 and t=10 sets the timer for 0:10. That way the element at t=9 doesn't override the timer to t=19 but keeps it at t=10.
That works but means we see a spike of timers at t=10, t=20, t=30 etc. There isn't any reason the timers all need to fire at these round increments so this is attempting to add fuzzing per key (since timers are per key). Ideally this means that any 1 key has buckets 10s apart so the overwriting problem is fixed but also means that across multiple keys the buckets don't all fire at the same time. I believe this is what the random.seed(key) on line 260 is doing but correct me if I'm wrong.
Also, let me know if you know an easier way to obtain this pattern.
Fixes two problems found in async during load testing:
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>instead.CHANGES.mdwith noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.